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Existing procedures for model validation have been deemed inadequate for many engineering sys- 
tems. The reason of this inadequacy is due to the high degree of complexity of the physical 
mechanisms that govern these systems. It is proposed in this paper to shift the attention from 
modeling the engineering system itself to modeling the uncertainty that underlies its behavior. 
A mathematical framework for modeling the uncertainty in complex engineering systems is de- 
veloped. This framework uses the results of computational learning theory. It is based on the 
premise that a system model is a learning machine. 

Categories and Subject Descriptors: 1.2.6 [Artificial Intelligence]: Learning — Parameter learn- 
ing, Induction; 1.6.4 [Simulation and Modeling]: Model Validation and Analysis; J. 2 [Physical 
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ity and Statistics] 
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1. INTRODUCTION 

Modeling of engineering systems such as wastewater treatment plants, groundwater 
contaminant transport, membrane fouling, sediment transport phenomena, • • ■ is 
traditionally carried out in three sequential steps: 

i model development: the modeler collects the available knowledge about the stud- 
ied system S in the form of first principles, empirical laws and/or heuristic hy- 
potheses. Based on this knowledge, the modeler develops a set of mathematical 
relationships (i.e., the system model A4) among the system state variables, which 
can generally be written in the form of a differential equation: 



where t is the time, x is the system state vector, p is the model parameter vector 
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and f is a mathematical function generally nonlinear. 

ii model identification: after the model is developed, the modeler uses a set Tjv 
(N e H°) of empirical data: 

T N : x dQta (t 1 ),x data (i 2 ),...,x data (^) (2) 

collected from the real operation of the system, to identify the model parameters. 
This step usually requires the minimization of an objective function J(p) of the 
form: 

N 

J(p)=£||x(p,i fe )-x^(t fc )|| 2 (3) 

fc=i 

where x(p, t) represents the solution to the model equation [l]. In most cases, 
the data set Yjy would actually be divided into two subsets and T^ 2 

(N = Ni + N2). The first subset (called identification sample) is used for the 
model parameter vector identification, and the second (called validation sample) 
for model validation (step below). 

iii model validation: in this step, the identified system model is tested on the val- 
idation subset T jv 2 that it has never "seen" . If the model performs well on this 
sample, then it is retained. Otherwise, the model structure is adjusted and the 
validation procedure repeated. 



The foregoing model validation procedure (called cross validation) has been crit- 
icized in many areas of engineering. In wastewater engineering, for example, Jepps- 



son [1996 1 pointed out that, "in strict sense, model validation is impossible" with 



the existing validation techniques. Similarly, Zheng and Bennett [1995 1 noted that 
in groundwater engineering, "models, like any scientific hypoth esis, cannot be val- 
idated in the absolute sense . . . They can only be invalidated". Konikow and Brc- 



dehoeft [1992 1 suggested that terms like model verification and model validation 
convey a false sense of truth and accuracy and thus should be abandoned in favor 
of more realistic assessment descriptors such as history-matching and benchmark- 
ing. 



The engineering systems for which the cross validation procedure is deemed in- 
adequate all share one same feature: the mechanisms that govern each one of them 
are so complex that no one model can be considered to describe these mechanisms 
in their entirety. The predictions of a model, no matter how sophisticated it is, are 
not guaranteed to match the reality. In this paper, it is proposed to shift the at- 
tention from modeling the system itself to modeling the uncertainty that underlies 
its behavior. The aim is to answer questions such as: what makes uncertainty high 
or low? How can it be controlled and to what extent can it be reduced? 



A mathematical framework for modeling the uncertainty in complex engineering 
systems is developed in this paper. This framework is based on the premise that 
a system model is learning machine. The model identification procedure is viewed 
as a learning problem or, equivalently, an information transfer from a finite set of 
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real data T n into the system model. 



The framework of this paper is based on the extensive research work by Vapnik 
[1982|, |Vapnik [1995|, jVapnik [1998|| and that of |Vapnik and Chcrvoncnkis [1968| . 



Vapnik and Chervonenkis [1981| , Vapnik and Chcrvoncnkis [1991] in the area of 



mathematical statistics and its applications to computational machine learning the- 
ory. The next section shows why and how a system model can be considered as 
learning machine. The remainder of the paper is devoted to the framework devel- 
opment. 

2. A SYSTEM MODEL IS A LEARNING MACHINE 

Assume that we are interested in the variations of one state variable Xi of the 
system S and consider the model differential equation that governs the dynamics 
of this variable: 

x lQ = /(t,x,p) 

or 

where t is the time, x is the process state vector, p is the parameter vector and 
/ is a real- valued function. This equation represents one component of the vector 
differential equation: 

x = f(i,x,p) 

of the system model M.. However, the vectors x and p in equation ^ do not 
necessarily contain all of their components. Normally, they should be denoted as 
x Xig and p XiQ and equation |] should become: 

in order to highlight the fact that x and p contains only those state variables and 
parameters, respectively, that influence the dynamics of Xi . 

This study will be limited to the case of autonomous systems, i.e., systems whose 
models do not depend explicitly on time. In other words, the general model equation 
that governs Xi can be written as: 

^=/(x* i0 ,P, i0 ) (6) 

In addition to Xi , all state variables, components of x Xi , are assumed to be di- 
rectly and separately measurable. 

Using the Euler method to numerically integrate equation ^, the time is discretized 
with a time step of At and then Xi is computed at times 

h = At , t 2 = 2 At , ... , t n = n At , ... 

using the following equation: 

Xi (tn) Xi Q (t n -i)+Atf(x Xio (t n ^ 1 ),p Xio ) (7) 
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Define w as the value of Xi to be predicted by the model M., that is: 

wM = x ia (t n ) 

Similarly, define the vector v as: 

v = [xi {t n -i),*wi Q {tn-i) T ] T (8) 

The superscript T means transposed vector. The number w M takes values from a 
sub-set W of the real line 5ft, and vector v from a multi-dimensional space V. 

Now introduce the real-valued function H defined as: 

ff(v,Px i0 ) = z io (t n _i) + At/(x JB4o (t n _i),p. eio ) (9) 

The expression of this function corresponds to that of the right-hand side of equation 
[?]. The latter equation becomes then: 

w M =H(w,p XiQ ) (10) 

For a fixed parameter vector p Xi , H( . , p Xi ) represents a mapping function from 
V to W: 



H( . ,p Xio ) : V -> W 

v i * = (v, 



A4 _ „ ^ (11) 



The parameter vector p Xio takes values from a multi-dimensional space denoted 
here as T. Define the functional set TLm of all mappings H( . , p Xi ) with e T: 

H M = {H{. ,p Xi0 ) | Px io er} (12) 

Now assume that a sequence of instances of the couple (v, w): 

T N : (vi,wi), (v 2 , w 2 ), ■ ■ • , (vat, w n ) 

can be obtained from the real process operation, and consider an algorithm A 
that receives the sequence T^v as input and produces a parameter vector (p Xio )emp 
corresponding to the function H{ . , (p Xi ) emp ) G Hm that best approximates the 
real process response. In practice, this algorithm corresponds to the system model 
identification procedure which consists in minimizing an objective function of the 
form: 

N 

J(p) = J2\^k-H(v k ,p)\ 2 (13) 



fc=i 



or, equivalently: 



1 N 

R emp (p) = jjYl " H (vk,p)\ 2 (W) 

fe=i 

The subscript emp means "empirical" and the number \wk — H(vk,p)\ 2 represents 
a measure of the loss between the desired response wu corresponding to the vector 
Vfc and the model prediction represented by H(vk, p). 

A set of mapping functions equipped with an algorithm such as A is called a learn- 
ing machine in the area of artificial intelligence and computational learning theory. 
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We have then shown above that the couple CMs = (Hm,A), composed of a sys- 
tem model and an identification procedure, can be viewed as a learning machine. 
On the basis of this result, it is possible to develop a mathematical framework that 
will allow us to model the uncertainty that underlies the behavior of the engineer- 
ing system S. The next sections of this paper are about the development of such 
framework. 

Remark: Note that training of the machine 

CM S = {H M ,A) 

associated with the system S is carried out for a specific time t n . This time is arbitrary, but fixed. 
The examples (vi, w%), (v2, uij), — , (vjf , tujv) to be used for machine training should therefore 
correspond to a series of realizations of the system at time t n . In practice, this is not possible, 
because the instance vector v and the outcome w are measured only once at any time instant t. 
And what is obtained from these measurements is actually a time series: 

(v tl ,w tl ), (vt 2 ,w t2 ), ■ ■ ■ , (v t „,u>ij, ■ • • 

whose terms represent the couples instance/outcome at successive time instants ti, *2, • • ■, tn, • ■ • 
It corresponds to one realization of the system S in time. This realization would usually — if not 
always — be the only one that is available for investigating the system's behavior. The property 
that allows us to use the series (yt i ,wt i ) instead of (vi,Wi) is called ergodicity. This condition is 
quite weak and will be assumed to hold true for the studied system S. An extensive discussion of 
such condition can be found in |Guergachi [199£ ] . 

3. GENERAL DESCRIPTION OF THE FRAMEWORK 

In a certain environment £, a situation v arises randomly and a transformer T acts 
and assigns to this situation v a number w obtained as a result of the realization 
of a random trial. Formally, situation v represents a vector that takes values from 
an abstract space V called instance space. It is generated according to a fixed but 
unknown probability density function (pdf) P v defined on V. The number w, which 
is dependent on v, takes values from another space W C 5ft called outcome space. 
It is generated according to a conditional pdf P w \ v defined on W, also fixed but un- 
known. The mathematical object (v, w) arises then in the product space Z = V xW 
(called sample space) according to the joint pdf P( v ,w) — PvPw\v, which character- 
izes the probabilistic environment £. In what follows, the couple (v, w) is denoted 
as z (to mean that it takes values from the sample space Z). Using this notation, 
the joint pdf P< VlW ) is then denoted as P z . The vector v will be indifferently called 
"situation" or "instance" and the number w "outcome" or "transformer's response" . 

If the behavior of transformer T is governed by a process which is a dynamic one, 
this transformer would usually possess several different operating modes. To each 
mode would correspond a different pdf P z and a different range of variation of v 
and w. To illustrate what is meant by "operating mode" here, consider for instance 
the behavior of an automotive engine: the operating conditions of such engine are 
not the same when the car is climbing a hill and when it is taking a highway. In 
the first case, the engine develop a very high torque and the speed is low, while 
in the second case, the same engine operates under opposite conditions: the speed 
is high but the torque is low. Another example that illustrates this concept of 
"operating mode" is a wastewater treatment plant using the activated sludge pro- 
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cess: the operation of this plant can use little return of sludge and low solids in 
the aeration tank in order to achieve the objective of removing soluble substrate 
with relatively low oxygen supply. But this plant could also be operated with the 
purpose of aerobically destroying all of the organic solids in the waste, which can 
be done by returning all the sludge to the aeration tank. Thus, the same plant 
could operate under different operating conditions. In what follows, the operating 
mode of the transformer T will be denoted by OM. 

Associated with the environment £ = (T,OM, z, P z ) is a learning machine CM 
whose objective is to understand the behavior of the transformer T. It receives a 
finite sequence Yjy of N training examples: 

Tat : (vi, tui), (v 2 ,w 2 ), ■ • ■ , (vn,wn) 
or, using the z-notation: 

Tjv : zi, Z2, ■ ■ ■ , Zn 

generated and measured in the probabilistic environment £ as a result of one real- 
ization of this same environment. Based on these training examples, the learning 
machine CM selects a strategy that specifies the best approximation w CM of the 
transformer's response for each instance v. Once this strategy is selected, it will be 
used on all future situations v arising in the environment £, in order to predict the 
transformer's responses. This strategy, which is mathematically a mapping func- 
tion from V into W, is called a decision rule and is chosen from a fixed functional 
space H called decision rule space. 

The goal of CM is then to select, from the space H, that particular decision rule 
which best approximates the transformer's response. The expression "best approxi- 
mation of the transformer's response" means "closeness to the transformer's 'gen- 
eral tendency' g Tv . The latter function is defined as follows: 

3 r (v) = E(w | v) = / w P wW {w | v) dw (15) 

This function will be indifferently called 'general tendency' or 'response function'. 
Closeness is understood in the sense of the metric V defined in the following way: 

V/i € H, V(h,g T ) = ^E( l(h(v),g T (v)) ) = ^ Z(Mv), <7 T (v)) P v (v) dw (16) 

where I is defined throughout this paper as the quadratic loss: 

V(a,6)e3? 2 , l(a,b) = \a-b\ 2 

After receiving the sequence Tjv of training examples, the learning machine CM 
selects that particular decision rule ho that minimizes V(h,g T ) on the space H (h 
designates an element of H and g T the transformer's "general tendency" ) . Formally, 
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this means finding the minimum of the function: 

X>( . , g T ) : H 5R+ 

ft ^ 2?(M T ) 



and the decision rule ho at which this minimum is attained. To do so, CM imple- 
ments an algorithm A whose ultimate goal is to find ho on the basis of the finite 
sequence Tjy of training examples. 

Note that w is related to g r (v) through the following relationship: 



where e is the noise associated with the probabilistic environment £ . By the prop- 
erties of conditional expectation, it follows from [l7| that: 



Remark: The decision rule space Tt is considered to be indexed by a subset of 3t n for some n > 1, 
that is, there exist an integer n > 1 and a subset T C 5R n , such that the space Tt can be expressed 
as follows: Tt = {h p \ p G T}. This is the case for most engineering systems. 

4. OVERCOMING THE FIRST OBSTACLE IN MINIMIZING THE VALUE OF V 
OVER THE SPACE H 

The objective of the learning machine CM = {TL, A) is to minimize the distance 
V(h,g T ) over all the decision rule space TL. This distance involves two functions: 
h and g . The function h is an element of the space TL and, as such, it is well 
known to CM: once the components of v are measured, the value of h(y) is readily 
computable. The problem however is g T . Not only it is an unknown function and 
impossible to derive from first principles (recall that the systems we are dealing 
with are complex ones), but there is no operational way of getting even sample 
measurements or any empirical information about it. g T is indeed buried in noise. 
What we can measure, with respect to the transformer's response, is the outcome 
w, and w contains in it both the value of g T and noise, all mixed up. 

So how should CM proceed to minimize V(h,g' r ), when the only information it 
can get is in the form of noise-corrupted measurements of the outcome w and, of 
course, the instance v? Theorem [j] will be of great help. Before stating it, we need 
the following definition: 

Definition 1 (Expected Risk). LetS = (T,OM, z, P z ) be a probabilistic en- 
vironment and, associated with it, a learning machine CM = {TL,A). Let h G Tt be 
a decision rule. The expected risk R{h) of h is defined as the expected value of the 
random variable: 



w = .9 T (v) + e 



(17) 



E(e | v) = 



(18) 



l(h(v),w) = |/i(v) - w\ 2 



when the vector z — (v, w) is drawn at random in the sample space Z = V x W 
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according to the pdf P z = P( v ,w) corresponding to environment £. Formally, it is: 
R(h) =E( l(h(v),w) ) = / l(h(v),w)P (v . w) (v,w)dvdw (19) 

JVxW 



Also, to simplify the notations, we need the following definition: 

Definition 2 (Simplifying Notations). Let £ = (T,OM,z,P z ) be a prob- 
abilistic environment and, associated with it, a learning machine CM = (Ti,A). 
For every decision rule h G Ti, we define the real-valued function lh on the sample 
space Z = V x W as follows: 

V(v, w) G V X W, i fc (v, w) = l(h(v), w) (20) 



Hence, using the z-notation, equations and [Hj become: 

Vz = (v, w) g Z, l h (z) = l(h(v),w) (21) 

G H, R(h) = E(l h (z)) = [ l h (z) P z (z) dz (22) 

J z 



Theorem 1 (Transition V{h,g T ) — ► R(h) ). Let £ = (T,OM,z,P z ) be a 
probabilistic environment and, associated with it, a learning machine CM = (Ti, A). 
Let ho G Ti be a fixed decision rule. Then the function: 

h i ► V(h,g T ) 

is minimal at ho if and only if the function: 

h i ► R{h) 

is minimal at ho- 

Proof. Using equation [l5|, it can be shown that the equality: 

R(h)= [ [w-g T (v)f P ( ^ w) {v 1 w)dvdw + [V{h,g T )f (23) 

JVxW 

holds true for all h £Tt. Since the integral J VxW [w — ff T ( v )] 2 -P(v.«>)( v , w ) dv dw is 
independent of h, it follows that T>(h, g T ) is minimal if and only if R(h) is minimal, 
and that both functions attain their minimum at the same function ho-D 

Theorem [j] is very important in simplifying the learning problem CM is faced 
with. What it means is that minimizing T>(h, g T ) or, equivalently, the square of it 
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[D(h, g T )] 2 over H amounts to minimizing R(h) over the decision rule space. Look 
at the expressions of these two functions [D(h, g T )] 2 and R(h): 

[D(h,g T )] 2 = -E(l(h(v),g T (v))) (24) 

and 

R(h)=E(l{h{v),w)) (25) 

From these expressions, it can be seen that, in the course of minimizing D(h,g T ), 
theorem [j] allows us to replace the unknown and non-measurable noise-free value 
(7 T (v) by the measurable noise-corrupted value w, without loosing information on 
that decision rule ho at which the minimum of T>(h,g T ) is attained. 

The following theorem will be helpful for system uncertainty model development: 



Theorem 2 (First Inequality). Let £ — (T,OM,z,P z ) be a probabilistic 
environment and, associated with it, a learning machine CM = (Tl,A). Then the 
inequality: 

[V{h,g r )f <R{h) (26) 

holds true for any rule h G 7i. 



Proof. This inequality is a direct consequence of equality 23 



5. SECOND OBSTACLE: P z IS NOT KNOWN TO CM 

Theorem |l| is still not enough for CM to proceed to the determination of the rule 
ho that minimizes V(h,g T ). This is because R(h) is function of the pdf P z : this 
pdf embodies all sources of uncertainty in the environment £ and, as such, it is not 
known. The objective — and the power — of the framework developed here consists 
in avoiding any strong a priori assumption regarding the sources of uncertainty in 
£. Consequently, in what follows, P z is considered fixed but unknown. 



Now, having taken this stand on P z , we have to find a way of minimizing R{h) 
on the basis of only a finite number N of training examples z\, Z2, ■ ■ ■ , zm- How 
to do that? By introducing a principle called Xnductive Principle of Empirical 
7£isk A^inimization (IT > £1ZM). This principle has emerged in the mid-cightics 
as a result of an extensive resear ch work by [Vapnik [1982|, |Vapnik [1994 |Vapmk| 



1998 1 and that of Vapnik and Chervoncnkis [1968, Vapnik and Chervonenkis [1981 



Vapnik and Chervoncnkis [1991] 



6. INDUCTIVE PRINCIPLE OF EMPIRICAL RISK MINIMIZATION 

Before we state the TV£1ZM , we need to define the meaning of empirical risk of a 
decision rule: 



Definition 3 (Empirical Risk). Let £ = (T,OM,z,P z ) be a probabilistic 
environment and, associated with it, a learning machine CM = (7i,A). Let h G H 
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be a decision rule and Yjv = (zi,Z2, ■ ■ ■ j^iv) a finite sequence of N training ex- 
amples generated and measured in the probabilistic environment £ as a result of 
one realization of this same environment. The empirical risk Rj^^h) of h on the 
sequence Yjv is defined as the arithmetic mean of the sequence of numbers: 

{lh{Zi))i=l,2,...,N 

that is: 

1 - 

j£&(ft) = jvX>(*i) (27) 

i=l 



Having introduced the concept of empirical risk, we can now define what is meant 
by an uncertainty model: 

Definition 4 (Uncertainty Model). Let £ = (T,OM,z,P z ) be a proba- 
bilistic environment and, associated with it, a learning machine CM — (Ti,A). 
Let T n be a finite sequence of N training examples from the environment £ and rj 
a fixed real number in the interval ]0,1[. Let hjj% p be a decision rule at which the 
empirical risk K^ p (h) reaches its minimum. An rj- uncertainty model (or simply 
uncertainty model) of the transformer T is any inequality of the type: 

V(hl- p ,g T )<i P (e 1 ,e 2l ...,e l ) (28) 

that satisfy the following two conditions: 

— inequality |^ holds true with a probability of at least 1 — rj. 

— e\, ei, ■ ■ ■ ,ei are a set of uncertainty control variables and ip is a real-valued 
function of these variables that satisfy the following: 

{the variables and the function cp , 
are readily determinable/ computable 



Expected and empirical risks, R{h) and i?^ p (/i), may seem to introduce new con- 
cepts in this framework, but they are not if we go back to the concepts of probability 
theory. To see that, fix a decision rule h in the space H.. Since z is a random vari- 
able, the number lh(z) is then also a random variable. Denote it as £, that is: 

£ = lh(z) 

(Recall that h is fixed) From probability theory, we know that there are two mea- 
sures of the central tendency of a random variable such as £ : 

— an empirical measure: given a series of realizations £i, £21 • • • > £n of the variable 
£, this measure is constructed by computing the arithmetic average £i)/N of 
this series. 

— a mathematical measure: this measure is expressed in terms of the pdf of £, 
that is: J £-Pf(£)c££- It is called expected value. 
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In this framework, R^ p (h) represents the empirical measure of the central tendency 
°f £ = lh(z) and R(h) represents the mathematical one. The former measure is 
approximate but computable, the latter is exact but unknown. Also, note that, 
under some conditions with respect to the dependency and heterogeneity of the 
realizations £j, the empirical measure converges to the mathematical one when N 



is made infinitely large [White 1984]. This is known as the Law of Large Numbers 
in probability t 
risks, we get thi 
large. That is: 



in probability theory. Applying this law to the case of the expected and empirical 

^emp ( 



risks, we get that Rj"(h) converges (in probability) to R(h) as N is made infinitely 



Rj^ P ( h )^R( h ) as N^oo (30) 
The reader should note a very important fact here: the convergence |3^ is valid for 
a fixed decision rule h in the space TL. This is called pointwise convergence, as 
opposed to another type of convergence (called uniform convergence) that is dis- 
cussed briefly in the next sections. The term "pointwise" refers to the fact that the 
convergence [30] occurs only for fixed points of Ti and not for all points of this space 
simultaneously. 

Now, let's state the XVEIZM.. This principle consists in implementing the following 
two actions: 

— action 1: replace the expected risk R(h) by the empirical risk R^ p (h) computed 
on the basis of one training sequence Yjv; 

— action 2: take the decision rule hj^ p at which Rj^pi^ 1 ) reaches its minimum as 
a good representation of the best rule ho that minimizes the expected risk R(h). 

Therefore, the implementation of the TVEIZM. comes down to minimizing the em- 
pirical risk Rg^ p (h), instead of the expected one R(h), over the space Ti. and then 
choosing that decision rule hj^ at which the minimum of Rj^^h) is reached to de- 
scribe the transformer's behavior. Engineering systems modelers (in various areas 
of engineering such as chemical, civil or environmental) have been using this pro- 
cedure for system model identification for years. The reader may then wonder why 
we are developing a new mathematical framework, if all what we are going to do 
is to turn back to the traditional model identification procedure? What is the point? 

This framework is not about inventing new procedures, but rationalizing exist- 
ing ones and modeling the uncertainty that is associated with them. Engineering 
systems modelers have been using the traditional identification procedure without 
being aware of the transitions: 

V(h, g T ) — > R(h) — > Rl- p {h) (31) 

Their decision to rely on empirical risk minimization may be explained by the fact 
that mechanistic models (mechanistic models as opposed to balck-box ones) are 
usually assumed to contain adequate a priori information about the real system 
and, as a result, very little information would be lost in the transition: 

fl(fc) — > (ft) (32) 

Now we know that this is not true for a complex system, since all existing models 
represent just a simplified picture of the real system behavior. If the sequence Tjv 
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is a finite one, then there is definitely a loss of information in the transition |3^, 
that has always been ignored by engineering systems modelers. The aim of this 
framework is to rationalize and investigate the validity of this transition. First, we 
determine in what cases the replacement of R(h) by R^ p (h) can be legitimatized 
and, second, evaluate the loss of information that occurs in the course of this 
replacement. To do so, we need to examine the applicability of the XVETZAi, for 
which Vapnik's results will be of great help. 

7. APPLICABILITY OF THE 1VSTZM 
In the transition: 

V(h,g T )^R(h) (33) 

there is absolutely no information loss, in virtue of theorem [l| As a result, R(h) 
can be considered as an exact measure of the performance of the decision rule h 
when this rule is selected by CM. as an approximation of g T . The transition that 
is problematic is the second one: 

R(h) — ► Rj» p (h) 

Rjm P {h) is indeed just an estimation of R(h). Of course, one may argue that replac- 
ing R(h) by Rj^ p {h), as suggested in action 1 of the XVSTZM, can be legitimatized 
by the fact that, according to the Law of Large Numbers, Rjrn P (h) becomes a per- 
fect estimation of R(h) when the size N of the sequence Tn is made infinitely large. 
But, this fact cannot be used to justify action 2 of IVE1ZM.. Here is indeed the 
problem: 

As was done above, denote the decision rules that minimize R(h) and 
Rjm P (h) as ho and n^ p , respectively. This is equivalent to write that: 

RZZ P (h%Z p ) = fof Rj» p (h) (34) 

and 

R(h ) = inf R(h) (35) 
hen 

Action 2 of the IVEIZM. stipulates to take h^ p as a good representation 
of the best rule ho- For this to be justified, we need to ensure that ftj" 
is very "close" to minimizing the expected risk R(h) which is, as pointed 
out previously, an exact measure of rule's performance (meaning rule's 
closeness to g T in the sense of T>). In more concrete terms, we need that 
the value i?(ft. r rJ f p ) of the expected risk at h^ p be close to the minimum 
one R(ho), for N sufficiently large. That is: 

^« P )^%) as N^oo (36) 
(convergence is understood in probability) 



It has been shown [ Vapnik and Chcrvoncnkis 199l| that the pointwise convergence 



jdoes not guarantee the one that is really required for the purpose of the IVSTZAd, 

i.e., convergence p6|. In other words, it is possible that convergence |3(1 be satisfied, 
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but RftJ^p) remains always far from R(ho) — even for large values of N — , mean- 
ing that would never constitute a good approximation to the transformer's 
behavior. It is therefore important to verify whether the 1V81ZM. is applicable or 
not before using it in any learning problems. 

Taking into consideration the foregoing comments, the following definition shall be 
adopted for the meaning of the applicability of the IVE1ZM.: 



Definition 5 (Applicability of the TV SUM). Lets = (T,OM,z,P z ) be 
a probabilistic environment and, associated with it, a learning machine CM. = 
(7i,»4). Let Yn be a finite sequence of N training examples from the environments 
and let h^ p and ho be two decision rules that minimize the risks R^m P (h) and R{h), 
respectively (refer to equations and [ff^j. The IVS1ZM. is said to be applicable 
to (£,CM.) if, for any e > 0, the following equality holds true: 

Jim Pr ( sup S[R(h), R*&(h)] > e) = (37) 

6 being a deviation measure defined on the real line. 



Now that the applicability of XVEHM. has been defined, we need to develop a 
simple method of verifying it. In the foregoing discussion, it has been pointed 
out that the pointwise convergence [50] is not enough to guarantee the applicabil- 
ity of TVS1ZM . A more stringent condition regarding the empirical risk conver- 



gence needs to be imposed. Vapnik and Chcrvoncnkis [1991 1 have showed that 



for IVE1ZM. to be applicable, it is necessary and sufficient that the empirical risk 
RJm P (h) converges uniformly to the expected risk R(h) over the whole space H 
(convergence is understood in probability). Mathematically, uniform convergence 
means that equation |37] holds true. Intuitively, it means that, as N is made in- 
finitely large, the whole curve of Rj„y p (h) converges to that of R(h) over the space 
TL. In this presentation, the theoretical part of such questions will not be detailed. 
Instead, the reader is referred to Vapnik's book "Statistical Learning Theory" [1998] 
for the details. In what follows, Vapnik's results are presented in a more practical 
fashion, allowing direct application to the cases under study in this paper (i.e., 
engineering systems). The mathematical rigor is, however, preserved throughout 
the whole presentation. 

A criterion to verify the applicability of the TVEIZM. is not the only thing that 
is needed here. We also want to know how much information is lost when R(h) 
is replaced by Rj^ p {h). Here again, to evaluate this information loss, we need to 
define a measure of the deviation between R(h) and R~£^ p {h). For this purpose, 
two deviation relative measures are introduced: 

— relative measure 5\ defined by: 

VK,« 2 )efi 2 , S 1 [a 1 ,a 2 } = ^^ (38) 
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— relative measure 82 defined by: 

V(a X) a 2 ) e 5R 2 , <y 2 [oi, a 2 ] = - ~ " 2 (39) 

ai 

Each one of these two measures will be associated with a different weak prior in- 
formation about (£ ,£A4). 

Using these measures, the following theorem || defines sufficient conditions for the 
applicability of XVETZM. and helps evaluate the loss of information that occurs 
when Rifi) is replaced by Rj^ p {h): 

Theorem 3 (Applicability of the 1T£KM). Let £ = (T,OM,z,P z ) be a 
probabilistic environment and, associated with it, a learning machine CM. = (Ti,A). 
Let Y n be a finite sequence of N training examples from the environment £ and r\ a 
real number in the interval ]0, 1[. Let 8 be one of the deviation measures 8% or 82- If 
it is possible to establish some Weak Trior Information WPI about (£,CAA) and 
construct a function C dependent on N , the whole set TL, WPI and the number 
r\ such that both statements 1 and 2 listed below hold true, then the IV£1ZA4 is 
applicable to (£,CM). When such function: 

c = c(N,n,wvi, 77) 

exists, the 2V£1ZA4 is said to be 8-applicable to (£.CA4) with the bound 
C{N,H,WVI,ri). 

— Statement 1: for any r\ s]0, 1[, the inequality: 

sup 6[R{h),R*» p (h)] < C(N, H, WPI, rf) 
hen 

is satisfied with probability of at least 1 — rj. 
— Statement 2: when H,r) and WPI are fixed, then: 

lim C(N,H,WPI,ri) = 

Proof. Let e > and 77 G]0, 1[ be two fixed numbers. From statement 2, we infer 
that: 

3N Q e H, ViV > No, C{N, H, WVI, rf) < e 

Then, from statement 1, we get that for N > Nq, the inequality: 

sup8[R(h),Rj» p (h)] <s 
hen 

is satisfied with probability of at least 1 — 77. That is: 

Pr (sup 8[R(h),Rj» p (h)} >e) < V 

\heH J 

Thus, we have shown that, for any e > 0: 

Vtj €]0, 1[, 3N € N, VJV > N , Pr ( sup 8[R(h), Rj" (h)} > e] < 77 

\hen J 
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which means, by definition, that: 



lim Pr sup 6[R(h),Rj£Jh)] > e = □ 



hen 



Now recall that the objective of this study is to develop uncertainty models (see 
definition |]) for complex engineering systems. The following theorem defines a way 
of developing such models: 



Theorem 4 (Uncertainty Model). Let £ = (T,OM,z,P z ) be a probabilis- 
tic environment and, associated with it, a learning machine CM = (Ti,A). Let 
Tjy be a finite sequence of N training examples from the environment £ and rj a 
real number in the interval ]0, 1[. Let WPT be some weak prior information about 
(£,CA4) and h^ p a decision rule at which the empirical risk R£% p (ti) reaches its 
minimum. 

—If the 1V£TZM is d 1 -applicable to (£,CM) with the bound C(N,H,WPZ,rj), 
then the inequality: 

[D{h emp ,g )] < RempQ^emp) + 



C 2 (N,n,Wri,n) / ARir% p {h empl 




C 2 (N,H,WV1, 7?) 



(40) 



holds true with probability of at least 1 — yj. 
—If the JV£TZM is ^-applicable to (£,CM) with the bound C{N,H,n,WPl), 
then the inequality: 

holds true with probability of at least 1 — rj, where (a) + = sup(a,0). 

Proof. HthelV£KM is ^-applicable to {£,CM) with the bound C(N,H, rj, WPT), 
then, from theorem ||, it follows that (all inequalities hold with probability of at 
least 1 — 77) : 



empj l emp\ empj 



Rifiemp) 



< C(N,H,WP1, v) 



Hence: 




C 2 {N,U,WVI,ri) I I 4i? e Y „~(/ le I m N P ) 



C 2 (N,H,WT2, r)) 



and then, from theorem g, it follows that: 

\P 'C 1 emp j 9 )] — ^■empQ l emp) 
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c 2 {N,n,wri, v ) ( I 4Rj n ? p {hZ» p ) 



V C 2 {N,H,WT2,ti) 

Similarly, if the 1TSKM is ^-applicable to (£, CM) with the bound C(N, H, rj, WP1), 
then, from theorem ||, it follows that: 

R(h TN ) - i? Tjv (h XN ) 

and then: 

[P(/i Tjv o T )1 2 < i?(/i Yw ) < p p □ 



The bound on the squared distance [D (h^ p ,g )] , when it exists, is called gifar- 
anteed deviation between h"^ and <? r , and denoted as <p or as 

<p{N, H, i£* WVT, n) 

8. THE VAPNIK-CHERVONENKIS (VC) DIMENSION 

One of the objects which the guaranteed deviation ip is dependent on is the whole 
set Ti. of decision rules. Now we need to know exactly what characteristic of H 
affects (p and the uncertainty models ^ and Intuitive analysis of uncertainty in 
engineering systems shows that this characteristic is the complexity of Ti jGuergachi 



1999]. The objective of this section is to define a measure of this complexity. This 



measure is known as the Vapnik-Chervonenkis dimension, or simply VC dimension, 



named in honor of its originators, Vapnik and Chcrvonenkis [1968 1. The definition 



of this dimension is quite difficult to assimilate from the first reading. Because of 
this, an intuitive interpretation of VC dimension will be first given and, at the end 
of this section, a series of illustrative examples will be presented. 

8.1 Intuitive Introduction 

Consider the following concrete example: 
—Vi = 5R and Wi = 5ft; 

— Tt = TLune is the set of all functions h from V into W such that: 

Vx e V, h(x) = pix + p 2 

with p = (pi,p 2 ) S 5ft 2 is the parameter vector. 

If we had to assign a number to the complexity of this set of functions, then intu- 
itively the number two, corresponding to the number of parameters, would be the 
most suitable one. Consider now this second example: 

—V 2 = 5R and W 2 = 5R; 

— Tt = H. sine, is the set of all functions h from V into W such that: 

\fx G V, h{x) — pi sin(p2^) 
with p = (pi,P2) S Sft 2 is the parameter vector. 
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Since the number of parameters that define this set is also two, we may be tempted 
to again assign the number two to the complexity of this set. If we do so, it would 
mean that TLu n e and TL S ine have the same degree of complexity, which is obviously 
not correct: the set Hu n e is a family of just straight lines, while 7i S ine is a complex 
family of curves that can take many different shapes. The "expressive power" of 
Hsine is indeed much higher than that of Hune- As a result, it should be expected 
that the complexity of H. s ine be much higher than that of Hu ne , and that is what 
we get when we consider the VC dimension as a measure of the complexity of the 
decision rule space. 

Intuitively, the VC dimension may be considered as equal to the maximum number 
of points that the curves representing the functions of the decision rule space can 
pass through simultaneously Straight lines (functions defined by h(x) — p\x + P2, 
space Hune) can pass through any 2 points, but not any 3 points. Parabolas 
(functions defined by h(x) — p\x 2 + P2X + P3, space H. par ab) can pass through any 
3 points, but not any 4 points. Sine functions (h(x) — pisin(p2x), space Tf sine) 
can pass through any number of points. Hence, if the VC dimension of a space Tf 
is denoted as q(H), then: 

qCHiine) = 2 

q{7~tparab) = 3 
q{H sin e) = OO 

The foregoing intuitive interpretation of VC dimension is approximate. A more 
precise definition of it is given in the next section. 

8.2 Definitions 

For every set /, the notation 2 1 will designate the set of all subsets of /. 

Definition 6 (VC Dimension of a Family of Sets). Let G be some space 
(DR™ with n > for example or any other space). Let Q be a family of subsets of G 
(examples of Q in the case of G = 3? 2 are the family of all open (or closed) balls of 
3? 2 or the family of all half planes of$t 2 ) and L a finite subset of G. Let U s (I) be 
the subset of 2 1 defined as follows: 

U g (I) = {Ae 2'\3F e Q,A = F C\ 1} 

The finite set I is said to be shattered by the family of sets Q if IF (I) = 2 1 . The 
largest integer q such that some finite subset L C G of size q is shattered by Q is 
called the Vapnik-Chervonenkis dimension (VC dimension) of the family Q. Lt is 
denoted by q = q(G)- If such integer q does not exist, then the VC dimension of Q 
is said to be infinite. 

Definition 7 (VC Dimension of a Family of Functions). Let T be a fam- 
ily of real-valued functions on some space G and I a finite subset of G. For every 
function f G T , define the subset pos(f) of the space G as follows: 

pos(f) = {a e G\ f(a) > 0} 
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Then define the family pos(T) of subsets of G as follows: 

pos{T) = {pos(f)\ fef} 

The finite set I is said to be shattered by the family of real-valued functions T , if it 
is shattered by the family of subsets pos(T). The Vapnik-Chervonenkis dimension 
(VC dimension) q{F) of the family T of real-valued functions is, by definition, equal 
to the Vapnik-Chervonenkis dimension of the family of subsets pos(T): 



The VC dimension is then a purely combinatorial concept that has, a priori, no 
connection with the geometric notion of dimension. In most situations, it is difficult 
to evaluate the VC dimension by analytic means. Usually, all what it is possible is 
to determine a bound on the VC dimension, that is, establish an inequality of the 
form: q{F) < qo (qo € N). Also in some cases the VC dimension is simply approxi- 
mated by the free parameters of the family T . The following theorem shows how to 
determine it in some particular cases. It also establishes a link with the geometric 
notion of dimension. 



Theorem 5 ( VC Dimension and Vector Space) . Let T be a family of real- 
valued functions on some space G. Fix any function fo from G into 5ft and let Tq 
be the new family of functions defined by !Fq = /o + T = {fo + f\ f G J-}. If T is 
an m-dimensional real vector space, then the VC dimension q(J-o) of J-q is equal to 
m: 

q^o) = m 



Proof. Refer to [Wenocur and Dudley 1981 for the proof of this theorem □ 



8.3 Examples 

— Example 1: Consider the family of functions h p defined from the space G = 5i" 
(n e H° ) into {0, 1} by: 

n 

Vx = (xi,x 2 , ...,x n )e Si™, h p (x) = ^(y^jXj) 



where p = (pi,P2, ■ ■ ■ ,Pn, 0) £ is the parameter vector and ip is defined by 

(real threshold 9): 



Ha) 



1 ifa> 
if a < 



This family of functions is known as the perceptron and is used in pattern recog- 
nition. Its VC dimension is equal to n + 1 | Anthony and Biggs 1992 ] . 

Example 2: Consider the family of real- valued functions h p defined on some 
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space G by: 



Vx G G, h p (x) = VV?/>j(x) 



i=l 



where p = (pi,P2; ■ ■ ■ ,Pn) € 5ft™ is the parameter vector and ipi, 1^2, ■ ■ ■ , VVi i s a 
sequence of n linearly independent real-valued functions. The VC dimension of 



this family of functions is equal to n [Vapnik 1982]. Note that the determination 
of this VC dimension results directly from theorem g. 
-Example 3: Consider the family of functions h p defined on G = 5ft 2 by: 

V(x, y) £ 5ft 2 , h p (x, y) = (y- poly n (x, p)) 2 

where p = (j>o,pi,P2, ■ ■ ■ ,Pn) € 5ft™ +1 is the parameter vector and poly m (x, p) is 
a polynomial function of degree n defined by: 

Mx £ 5ft, poly n (x, p) = pa + pix + p 2 x 2 + . . . + p n x n 



The VC dimension of this family of functions h p is at most 2n + 2 [Vapnik 1995]. 
-Example 4: Consider the family of functions h p defined on G = 5ft by: 

\/x £ 5ft, h p (x) = pi sin(p2x) 

where p = (pi,P2) € 5ft 2 is the paramete r vector. The VC dimension of this 
family of functions is infinite | Vapnik 199§|] . 



From these examples, it can be seen that, generally speaking, the VC dimension of 
a family of functions is not always related to the number of parameters. It can be 
larger (example 4), equal (examples 1 and 2) or smaller (see Vapnik 1995[ where 



new types of learning machines were constructed) than the number of parameters. 
9. VC DIMENSION AND APPLICABILITY OF THE 1VSKM 

In section 7, the concept of applicability of TVElZAi and that of guaranteed de- 
viation between the decision rule h"^ p that minimizes the empirical risk and the 
transformer's response function g were introduced. However, no methodology has 
been developed to determine the expression of the function C = C(N,H,WT'T,r]) 
(see theorems [| and |]), which is the key function in implementing those concepts. 
In this section, some fundamental results with respect to the determination of such 
function are presented. These results make use of the VC dimension concept defined 



in the previous section and they are due to [Vapnik 199S]. Extensive discussion and 
application of these results to model identification and quality evaluation can be 
found in Guergachi [1999| . 



Before stating these results, we need to define a new space lu and five different 
conditions. 



Definition 8 (Space l n )- Let £ = (T 7 OM,z,P z ) be a probabilistic environ- 
ment and, associated with it, a learning machine CM. = i^H.,A). For every decision 
rule h £ Ti and a real number (3 £ 5ft + , we define the real-valued functions I^r on 
the sample space Z = V x W as follows: 



Mz £ Z, lh,p(z) = h{z) - (3 
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The functional space of all functions lh,p will be denoted by In-' 

ln = {lhA {h,(3)eHx$t+} 



Now let's define the following conditions C.l , C'.l, C.2, C.3 and C.3: 

C.l Weak "Prior Information (1): 

There exists a positive number M e]0, +oo[ such that: 

sup l h (z) = M 
heH,zez 

C.l Weak "Prior Information (2): 

There exist a pair (s, t) <G 3? 2 with s > 2 and r < +oo such that: 

v 1/s (Mz)] s ) , 

SUP — < T 

hen n{n) 

C.2 VC Dimension: 

The VC dimension q = q(ln) °f the functional space l n is finite. 

C.3 i.i.d. condition: 

The training examples: 

Z\ , Z2 , ■ ■ ■ , Zn 

of the sequence ¥n are independent and identically distributed (i.i.d.). 

C.3 Weaker i.i.d. condition: 

The real-valued random variables: 

lh(zi); lh(z 2 ); . . . ; h{zN) 

obtained by computing the values of lh at each one of the training examples Zi of 
the sequence Tn, are independent and identically distributed (i.i.d.) for any h G 7Y. 



Theorem 6 {IV SUM applicability and VC (1)). Let £ = (T,OM,z,P z ) 
be a probabilistic environment and, associated with it, a learning machine CM. = 
(H,A). Let t n be a finite sequence of N training examples from the environment 
£ and n a real number in the interval ]0, 1[. // the conditions C.l, C.2 and C.3 
are satisfied, then the 1V£TZM is 5\-applicable to (£,£A4) with the bound: 

C = ^/MC (42) 

where: 



—The number ( is: 
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-q is the VC dimension q{lu) of the space I 



H ■ 



Proof. [Vapnik 1998 1 showed that, for any e > 0, the following inequality holds 
true: 



Pr ( sup <Ji[i?(/i), R*" (h)] >e) < 4 exp 
\hen 



i 



N 



N 



(43) 



when conditions C.l, C.2 and C.3 arc satisfied Vapnik 199S], see inequalities 5.24 
and 5.12 at pages 197 and 192 respectively). Set the right hand side of the above 
inequality equal to rj. Then the expression of e is: 

e = a/MC 

and, therefore, from Vapnik's inequality, it follows that the inequality: 

snp 5 1 [R(h),Rj 1 - p (h)] < y/MC 
hen 



holds true with probability of at least 1 — 77. □ 



Theorem 7 (1VSTZM applicability and VC (2)). Let £ = (T,OM,z,P z ) 
be a probabilistic environment and, associated with it, a learning machine CM = 
(Ti,A). Let Y n be a finite sequence of N training examples from the environment 
£ and rj a real number in the interval ]0, 1[. // the conditions C.l, C.2 and C.3 
are satisfied, then the 2VS1ZM is b~ 2- applicable to (£,CM) with the bound: 



(44) 



wht 



— 7(s) 
— The number £ is: 



C = 4 



2/V 



MI) 



N 



-q is the VC dimension q{lu) of the space I 



H ■ 



Proof. I Vapnik 1998 1 showed that, for any e > 0, the following inequality holds 
true: 



Pr ( sup 5 2 [R(h),Rj»Jh)] > 7 (s)r£ ) < 4 exp 



hen 



In 



1 



N 



(45) 

when conditions C.l, C.2 and C.3 arc satisfied Vapnik 1998], see inequalities 5.43 
and 5.12 at pages 210 and 192 respectively). Set the right hand side of the above 
inequality equal to n. Then the expression of e is: 
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and, therefore, the inequality: 



sup S 2 [R(h), Rj^ p (h)} < 7 (s) rVC 
hen 

holds true with probability of at least 1 — rj.O 



Note that WV2 is represented by the number M in theorem [6] and by the numbers 
s and t in theorem 0. 

The following theorem uses a weaker i.i.d. condition {C.3): 



Theorem 8 (Using condition C.3). If the third condition C.3 in the two 
previous theorems ^| and |t] is replaced by the condition C.3 and the two other 
conditions, C.l and C.2 for theorem ^ and C.l and C.2 for theorem 0, are kept 
unchanged, then the XV£TZM is still applicable to {£, CM) with respect to the same 
deviation measures Si and 82 and with the same bounds and EM respectively. 



Proof. To prove inequalities |43| and |45|, Vapnik [1982 , Vapnik [1995 ] made use 
of the weaker i.i.d. condition only. As a result, these inequalities remain true if 
condition C.3 is replaced by condition C.3. Consequently, the foregoing proofs of 
theorems (7) and || are still valid with condition C.3.0 

Using theorems ||, |s| and |J, it is now possible to develop uncertainty models for 
(£,CM) with a guaranteed deviation ip that is readily computable: 



Theorem 9 (Uncertainty Model and VC). Let £ = (T,OM,z,P z ) be a 
probabilistic environment and, associated with it, a learning machine CM. = (Ti,A). 
Let T 7v be a finite sequence of N training examples from the environment £ and rj 
a real number in the interval ]0, 1[. Let h^ p a decision rule at which the empirical 
risk R"£m P {h) reaches its minimum. 

— If the conditions C.l, C.2 and C.3 are satisfied, then the inequality: 



(46) 



holds true with probability of at least 1 — Tj. 

-If the conditions C.l, C.2 and C.3 are satisfied, then the inequality: 

(l-7(s)rVC)+ 
holds true with probability of at least 1 — rj. 
-k (a) + = sup(a, 0) for any number a £ 5i; 

s-l 




*7« = <y§U3 
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The number £ is: 



c 



+1 -ln(f) 



(48) 



TV 

* q is the VC dimension q{lu) of the space lu- 
Proof. This theorem is a direct consequence of theorems || and [l| □ 

Theorem ^| establishes two uncertainty models, UM\ and UM.2, for (£, CM). The 
first one, UMi, is based on the weak prior information WVT(1) and is defined 



by inequality 46 . The right-hand side of this inequality represents the guaranteed 



deviation ipi between and g T , developed on the basis of WPZ(l). Using this 
function <pi, the uncertainty model UM\ can be re-written as follows: 

UM X : [V(hJ„- p: g T )} 2 < Vl (N,n,Rj^ p (hJ^ p ),WVl(l) lV ) (49) 



with: 



(50) 

The second model, UM2, is based on the weak prior information WPT(2) and 




is defined by inequality 47, Denoting the right-hand side of this inequality as ip2 
(guaranteed deviation developed on the basis of WPT(2)), the uncertainty model 
UM.2 can be re- written as: 

UM2 : PihJ^g 7 )} 2 < ^2(N,n,Rj^ p (hJ^ p ) 7 WVl(2) lV ) (51) 



with: 

-,Y N /j Ttv \ '\A'}'T>'Tf r i\ \ - Lt emp\"'emp 



(1-7(s)tVC)h 



10. HOW TO START THE APPLICATION OF THE FRAMEWORK - EXAMPLE OF 
WASTEWATER TREATMENT PLANTS 



The reader is referred to Guergachi [1999 ] for an extensive discussion of the applica- 



tion of the mathematical framework developed in this paper. This section presents 
a very brief description of how the implementation of this framework can be started, 
by showing the process of defining the environment £ of the studied engineering 
system. Wastewater treatment plants are chosen as an example to illustrate this 
implementation. 



Defining the Environment £ ww t for a Wastewater Treatment Plant 

The probabilistic environment £ ww t for a wastewater treatment plant can be an 
urban area, a city, a small community or a watershed. The transformer T wwt is the 
wastewater treatment plant itself, which is located within the environment £ ww t- 
This plant uses the activated sludge process to treat the wastewater generated in 
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£wwt- The situation v encompasses the inputs to the plant and the state variables 
of the activated sludge process. It takes all its values in a space V. The probability 
density function P v is a characteristic of the nature and amount of uncertainty 
associated with the environment £ ww t- Two environments E ww t 1 and £ ww t 2 with 
similar features (population, people's customs, types of industries, climate, plant 
configuration, . . . ) would have almost the same probability density function. The 
outcome w is the future value of one state variable of the treatment process; it 
can be either the substrate (i.e., the waste) concentration or the microorganisms 
concentration. The variable w takes values from some subspace W of 3?. The 
conditional probability density function, P w \ v , of the outcome w given the instance 
v is a characteristic of the plant T wwt . Two plants T wwtl and T wwt2 with similar 
design, history, operating mode and control strategy would have almost the same 
conditional probability density function. 

11. CONCLUSIONS 

A mathematical framework for modeling the uncertainty in complex engineering 
systems is developed. This framework uses the results of computational learning 
theory and is based on the premise that a system model is a learning machine. 
A definition of an uncertainty model is given and a principle called "Inductive of 
Empirical Risk Minimization" is introduced. The applicability of this principle is 
examined and the concept of "guaranteed deviation" defined. The system model 
complexity is measured using the VC dimension. Based on this dimension, two 
different uncertainty models were developed. 
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